Main task of the project is to develop demo version on image search by text query.
For demo version its' required to train the model, which will get as input the vectorized text and vectorized image, and will provide as output number from 0 to 1 - which will define does the text description fir to image.
Data provided in folder /datasets/image_search/.
File train_dataset.csv has the required information for models training:
One image could have up to 5 text queries. Query id has the following format: - <image file name>#<query_number>.
Folder train_images has images for models training;
File CrowdAnnotations.tsv — has data with information of correspondence of the image and query, obtained from crowd source. The columns numbers following:
File ExpertAnnotations.tsv has data on has data with information of correspondence of the image and query, obtained during the expert questionnaire. The columns are following:
3, 4, 5 — Experts rating.
Experts are rate the correspendance of image in query using the following scale:
1) — image does not match query;
2) — query has a description of a few elements of image, but overall not related to image;
3) — query and image correspond to each other with description of the details;
4) — query and image match on 100%.
File test_queries.csv has the data required for model testing: query id, query text, image file name. One image could have up to 5 text queries. Query if has the following format: - <image file name>#<query_number>.
Folder test_images has images for testing.
# libraries import
import re
import os
import nltk
import torch
import pandas as pd
import numpy as np
import transformers
from tqdm import notebook
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.metrics import mean_absolute_error, mean_squared_error
import psutil
from PIL import Image
import torchvision.models as t_model
import torch.nn as nn
from torchvision import transforms
from sklearn.model_selection import GroupShuffleSplit
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer, util
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings("ignore")
data loading
DATA_PATH = 'initial_data/'
df_train = pd.read_csv(DATA_PATH + 'train_dataset.csv')
crowd_annotation = pd.read_table(DATA_PATH+'CrowdAnnotations.tsv',header=None, names=['image', 'query_id',
'confirmed_percentage','confimed_qty', 'rejected_qty'])
expert_annotation = pd.read_table(DATA_PATH + '/ExpertAnnotations.tsv',header=None, names=['image', 'query_id',
'expert_1','expert_2', 'expert_3'])
df_test_queries = pd.read_csv(DATA_PATH + 'test_queries.csv', sep = '|')
df_test_images = pd.read_csv(DATA_PATH + 'test_images.csv')
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5822 entries, 0 to 5821 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 5822 non-null object 1 query_id 5822 non-null object 2 query_text 5822 non-null object dtypes: object(3) memory usage: 136.6+ KB
df_train.head()
| image | query_id | query_text | |
|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... |
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... |
for i in df_train.columns:
print(i, 'unique qty',len(df_train[i].unique()))
image unique qty 1000 query_id unique qty 977 query_text unique qty 977
Conclusion
expert_annotation.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5822 entries, 0 to 5821 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 5822 non-null object 1 query_id 5822 non-null object 2 expert_1 5822 non-null int64 3 expert_2 5822 non-null int64 4 expert_3 5822 non-null int64 dtypes: int64(3), object(2) memory usage: 227.5+ KB
expert_annotation.head()
| image | query_id | expert_1 | expert_2 | expert_3 | |
|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | 1 | 1 | 1 |
| 1 | 1056338697_4f7d7ce270.jpg | 2718495608_d8533e3ac5.jpg#2 | 1 | 1 | 2 |
| 2 | 1056338697_4f7d7ce270.jpg | 3181701312_70a379ab6e.jpg#2 | 1 | 1 | 2 |
| 3 | 1056338697_4f7d7ce270.jpg | 3207358897_bfa61fa3c6.jpg#2 | 1 | 2 | 2 |
| 4 | 1056338697_4f7d7ce270.jpg | 3286822339_5535af6b93.jpg#2 | 1 | 1 | 2 |
total_qty_experts = 0
for i in ['expert_1','expert_2','expert_3']:
print(i, 'unique qty', len(expert_annotation[i]))
total_qty_experts += len(expert_annotation[i])
print('total qty', total_qty_experts)
expert_1 unique qty 5822 expert_2 unique qty 5822 expert_3 unique qty 5822 total qty 17466
expert_annotation['expert_1'].unique()
array([1, 2, 3, 4], dtype=int64)
Conclusion
dataset has 5822 rows and 4 columns: image name, query id and three expert ratings;
total quantity of expert ratings is 17466, 5822 per expert;
Experts are rate the image and query correlation using scale from 1 to 4, where:
1) — image does not match query;
2) — query has a description of a few elements of image, but overall not related to image;
3) — query and image correspond to each other with description of the details;
4) — query and image match on 100%.
crowd_annotation.head()
| image | query_id | confirmed_percentage | confimed_qty | rejected_qty | |
|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 1056338697_4f7d7ce270.jpg#2 | 1.0 | 3 | 0 |
| 1 | 1056338697_4f7d7ce270.jpg | 114051287_dd85625a04.jpg#2 | 0.0 | 0 | 3 |
| 2 | 1056338697_4f7d7ce270.jpg | 1427391496_ea512cbe7f.jpg#2 | 0.0 | 0 | 3 |
| 3 | 1056338697_4f7d7ce270.jpg | 2073964624_52da3a0fc4.jpg#2 | 0.0 | 0 | 3 |
| 4 | 1056338697_4f7d7ce270.jpg | 2083434441_a93bc6306b.jpg#2 | 0.0 | 0 | 3 |
crowd_annotation.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 47830 entries, 0 to 47829 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 47830 non-null object 1 query_id 47830 non-null object 2 confirmed_percentage 47830 non-null float64 3 confimed_qty 47830 non-null int64 4 rejected_qty 47830 non-null int64 dtypes: float64(1), int64(2), object(2) memory usage: 1.8+ MB
total_crowd_qty = 0
for i in ['confimed_qty','rejected_qty']:
total_crowd_qty += sum(crowd_annotation[i])
print('total_crowd_qty', total_crowd_qty)
total_crowd_qty 144860
crowd_annotation['confirmed_percentage'].unique()
array([1. , 0. , 0.33333333, 0.66666667, 0.25 ,
0.6 , 0.2 , 0.5 , 0.4 , 0.75 ,
0.16666667, 0.8 ])
Total quantity of rating from crowd source - 144 860
Crowd source rating either confirm ot reject the match of image to query. The percentage of all rating for every image to be used as total rating from crowd source.
df_test_queries.head()
| Unnamed: 0 | query_id | query_text | image | |
|---|---|---|---|---|
| 0 | 0 | 1177994172_10d143cb8d.jpg#0 | Two blonde boys , one in a camouflage shirt an... | 1177994172_10d143cb8d.jpg |
| 1 | 1 | 1177994172_10d143cb8d.jpg#1 | Two boys are squirting water guns at each other . | 1177994172_10d143cb8d.jpg |
| 2 | 2 | 1177994172_10d143cb8d.jpg#2 | Two boys spraying each other with water | 1177994172_10d143cb8d.jpg |
| 3 | 3 | 1177994172_10d143cb8d.jpg#3 | Two children wearing jeans squirt water at eac... | 1177994172_10d143cb8d.jpg |
| 4 | 4 | 1177994172_10d143cb8d.jpg#4 | Two young boys are squirting water at each oth... | 1177994172_10d143cb8d.jpg |
df_test_queries = df_test_queries.drop(columns = 'Unnamed: 0')
df_test_queries.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 query_id 500 non-null object 1 query_text 500 non-null object 2 image 500 non-null object dtypes: object(3) memory usage: 11.8+ KB
for i in df_test_queries.columns:
print(i, 'unique qty',len(df_test_queries[i].unique()))
query_id unique qty 500 query_text unique qty 500 image unique qty 100
Test dataset has 100 images and 500 text queries
df_test_images.head()
| image | |
|---|---|
| 0 | 3356748019_2251399314.jpg |
| 1 | 2887171449_f54a2b9f39.jpg |
| 2 | 3089107423_81a24eaf18.jpg |
| 3 | 1429546659_44cb09cbe2.jpg |
| 4 | 1177994172_10d143cb8d.jpg |
df_test_images.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 100 non-null object dtypes: object(1) memory usage: 928.0+ bytes
Test dataset has 100 unique images
(df_test_queries['image'].sort_values().drop_duplicates().reset_index(drop=1) ==
(df_test_images['image'].sort_values().drop_duplicates().reset_index(drop=1))).unique()
array([ True])
All images are correspond to file names of images from df_test
For the obtaining of target value the following steps to be performed:
df_train_expert = pd.merge(df_train,expert_annotation,on=('image', 'query_id'),how='outer',indicator=True)
df_train_expert.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5822 entries, 0 to 5821 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 5822 non-null object 1 query_id 5822 non-null object 2 query_text 5822 non-null object 3 expert_1 5822 non-null int64 4 expert_2 5822 non-null int64 5 expert_3 5822 non-null int64 6 _merge 5822 non-null category dtypes: category(1), int64(3), object(3) memory usage: 324.2+ KB
# deletion of merge column
df_train_expert = df_train_expert.drop(columns = '_merge')
df_train_expert.head()
| image | query_id | query_text | expert_1 | expert_2 | expert_3 | |
|---|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 |
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 2 | 2 |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 4 | 4 | 4 |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 |
# calculation of optimal rating:
cumulative = []
for i in df_train_expert.index:
if df_train_expert['expert_1'][i] == df_train_expert['expert_2'][i] and (
df_train_expert['expert_1'][i] == df_train_expert['expert_3'][i]):
cumulative.append(df_train_expert['expert_1'][i])
elif df_train_expert['expert_1'][i] == df_train_expert['expert_2'][i] and (
df_train_expert['expert_1'][i] != df_train_expert['expert_3'][i]):
cumulative.append(df_train_expert['expert_1'][i])
elif df_train_expert['expert_1'][i] != df_train_expert['expert_2'][i] and (
df_train_expert['expert_1'][i] == df_train_expert['expert_3'][i]):
cumulative.append(df_train_expert['expert_1'][i])
else:
if df_train_expert['expert_2'][i] == df_train_expert['expert_3'][i]:
cumulative.append(df_train_expert['expert_2'][i])
else:
cumulative.append((df_train_expert['expert_1'][i] + df_train_expert['expert_2'][i] + df_train_expert['expert_3'][i]) /3)
df_train_expert['expert_cumulative'] = cumulative
temp_list = []
expert_max = df_train_expert['expert_cumulative'].max()
expert_min = df_train_expert['expert_cumulative'].min()
for i in df_train_expert['expert_cumulative']:
temp_list.append((i - expert_min)/ (expert_max - expert_min))
df_train_expert['expert_assesment'] = temp_list
# удалим expert_cumulative за ненеадообностью
df_train_expert = df_train_expert.drop(columns = 'expert_cumulative')
df_train_expert.head()
| image | query_id | query_text | expert_1 | expert_2 | expert_3 | expert_assesment | |
|---|---|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 |
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 2 | 2 | 0.333333 |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 4 | 4 | 4 | 1.000000 |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 |
# adding of min and max rating
min_list = []
max_list = []
for i in df_train_expert.index:
min_list.append(min(df_train_expert['expert_1'][i], df_train_expert['expert_2'][i], df_train_expert['expert_3'][i]))
max_list.append(max(df_train_expert['expert_1'][i], df_train_expert['expert_2'][i], df_train_expert['expert_3'][i]))
df_train_expert['expert_min'] = min_list
df_train_expert['expert_max'] = max_list
df_train_expert.head()
| image | query_id | query_text | expert_1 | expert_2 | expert_3 | expert_assesment | expert_min | expert_max | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 | 1 | 1 |
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 | 1 | 1 |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 2 | 2 | 0.333333 | 1 | 2 |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 4 | 4 | 4 | 1.000000 | 4 | 4 |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 | 1 | 1 |
# checking for the errors
check_list = []
for i in df_train_expert.index:
if (df_train_expert['expert_max'][i] - df_train_expert['expert_min'][i]) > 2:
check_list.append('alert')
else:
check_list.append('ok')
df_train_expert['check'] = check_list
# deletion of min and max columns
df_train_expert = df_train_expert.drop(columns = ['expert_min','expert_max'])
df_train_expert.head()
| image | query_id | query_text | expert_1 | expert_2 | expert_3 | expert_assesment | check | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 | ok |
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 | ok |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 2 | 2 | 0.333333 | ok |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 4 | 4 | 4 | 1.000000 | ok |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1 | 1 | 1 | 0.000000 | ok |
Display the errors
drop_df = df_train_expert[df_train_expert['check']=='alert'].reset_index(drop = True)
drop_df
| image | query_id | query_text | expert_1 | expert_2 | expert_3 | expert_assesment | check | |
|---|---|---|---|---|---|---|---|---|
| 0 | 542179694_e170e9e465.jpg | 300577375_26cc2773a1.jpg#2 | An officer stands next to a car on a city stre... | 1 | 2 | 4 | 0.444444 | alert |
| 1 | 3388330419_85d72f7cda.jpg | 3358558292_6ab14193ed.jpg#2 | The room full of youths reacts emotionally as ... | 1 | 2 | 4 | 0.444444 | alert |
| 2 | 542317719_ed4dd95dc2.jpg | 542317719_ed4dd95dc2.jpg#2 | A smiling child slides down a slippery tube slide | 1 | 4 | 4 | 1.000000 | alert |
def to_drop_ (df_train_expert):
drop_list = []
for i in df_train_expert.index:
if (df_train_expert['image'][i] == drop_df['image'][0] and df_train_expert['query_id'][i] == drop_df['query_id'][0]) or (
df_train_expert['image'][i] == drop_df['image'][1] and df_train_expert['query_id'][i] == drop_df['query_id'][1]) or (
df_train_expert['image'][i] == drop_df['image'][2] and df_train_expert['query_id'][i] == drop_df['query_id'][2]) :
drop_list.append('Y')
else:
drop_list.append('N')
df_train_expert['drop'] = drop_list
df_train_expert = df_train_expert[df_train_expert['drop'] == 'N']
df_train_expert = df_train_expert.drop(columns = 'drop')
return(df_train_expert)
df_train_expert.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5822 entries, 0 to 5821 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 5822 non-null object 1 query_id 5822 non-null object 2 query_text 5822 non-null object 3 expert_1 5822 non-null int64 4 expert_2 5822 non-null int64 5 expert_3 5822 non-null int64 6 expert_assesment 5822 non-null float64 7 check 5822 non-null object dtypes: float64(1), int64(3), object(4) memory usage: 538.4+ KB
df_train_expert = to_drop_(df_train_expert)
df_train_expert.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5819 entries, 0 to 5821 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 5819 non-null object 1 query_id 5819 non-null object 2 query_text 5819 non-null object 3 expert_1 5819 non-null int64 4 expert_2 5819 non-null int64 5 expert_3 5819 non-null int64 6 expert_assesment 5819 non-null float64 7 check 5819 non-null object dtypes: float64(1), int64(3), object(4) memory usage: 409.1+ KB
crowd_annotation.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 47830 entries, 0 to 47829 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 47830 non-null object 1 query_id 47830 non-null object 2 confirmed_percentage 47830 non-null float64 3 confimed_qty 47830 non-null int64 4 rejected_qty 47830 non-null int64 dtypes: float64(1), int64(2), object(2) memory usage: 1.8+ MB
crowd_annotation = to_drop_(crowd_annotation)
crowd_annotation.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47829 entries, 0 to 47829 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 47829 non-null object 1 query_id 47829 non-null object 2 confirmed_percentage 47829 non-null float64 3 confimed_qty 47829 non-null int64 4 rejected_qty 47829 non-null int64 dtypes: float64(1), int64(2), object(2) memory usage: 2.2+ MB
# deletion of useles columns
df_train_expert_final = df_train_expert[['image','query_id','query_text','expert_assesment']]
df_train_expert_final.head()
| image | query_id | query_text | expert_assesment | |
|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.000000 |
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.000000 |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.333333 |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1.000000 |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.000000 |
df_train_ttl = pd.merge(df_train_expert_final,crowd_annotation,on=('image', 'query_id'),how='outer',indicator=True)
df_train_ttl.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 51320 entries, 0 to 51319 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 51320 non-null object 1 query_id 51320 non-null object 2 query_text 5819 non-null object 3 expert_assesment 5819 non-null float64 4 confirmed_percentage 47829 non-null float64 5 confimed_qty 47829 non-null float64 6 rejected_qty 47829 non-null float64 7 _merge 51320 non-null category dtypes: category(1), float64(4), object(3) memory usage: 3.2+ MB
Ratings to be calculated as sum of
df_train_ttl[df_train_ttl['_merge'] == 'both'].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2328 entries, 0 to 5817 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 2328 non-null object 1 query_id 2328 non-null object 2 query_text 2328 non-null object 3 expert_assesment 2328 non-null float64 4 confirmed_percentage 2328 non-null float64 5 confimed_qty 2328 non-null float64 6 rejected_qty 2328 non-null float64 7 _merge 2328 non-null category dtypes: category(1), float64(4), object(3) memory usage: 147.9+ KB
df_train_both = df_train_ttl[df_train_ttl['_merge'] == 'both']
df_train_both.head()
| image | query_id | query_text | expert_assesment | confirmed_percentage | confimed_qty | rejected_qty | _merge | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.000000 | 0.0 | 0.0 | 3.0 | both |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.333333 | 0.0 | 0.0 | 3.0 | both |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1.000000 | 1.0 | 3.0 | 0.0 | both |
| 5 | 3030566410_393c36a6c5.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.333333 | 0.0 | 0.0 | 3.0 | both |
| 9 | 3718964174_cb2dc1615e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.000000 | 0.0 | 0.0 | 3.0 | both |
df_train_both['text_match_img'] = (df_train_both['expert_assesment']*0.6 + df_train_both['confirmed_percentage']*0.4)
df_train_both_final = df_train_both[['image','query_id','query_text','text_match_img']]
df_train_both_final['text_match_img'].unique()
array([0. , 0.2 , 1. , 0.53333333, 0.33333333,
0.4 , 0.8 , 0.73333333, 0.66666667, 0.46666667,
0.6 , 0.86666667, 0.5 , 0.28 , 0.3 ,
0.1 , 0.7 , 0.56 , 0.72 , 0.13333333,
0.4 ])
df_train_both_final.head()
| image | query_id | query_text | text_match_img | |
|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
| 2 | 2447284966_d6bbdb4b6e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.2 |
| 3 | 2549968784_39bfbe44f9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 1.0 |
| 5 | 3030566410_393c36a6c5.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.2 |
| 9 | 3718964174_cb2dc1615e.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
Previously calculated ratings to be used (only expert rating)
df_train_ttl[df_train_ttl['_merge'] == 'left_only'].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3491 entries, 1 to 5818 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 3491 non-null object 1 query_id 3491 non-null object 2 query_text 3491 non-null object 3 expert_assesment 3491 non-null float64 4 confirmed_percentage 0 non-null float64 5 confimed_qty 0 non-null float64 6 rejected_qty 0 non-null float64 7 _merge 3491 non-null category dtypes: category(1), float64(4), object(3) memory usage: 221.7+ KB
df_only_experts = df_train_ttl[df_train_ttl['_merge'] == 'left_only']
df_only_experts.head()
| image | query_id | query_text | expert_assesment | confirmed_percentage | confimed_qty | rejected_qty | _merge | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 | NaN | NaN | NaN | left_only |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 | NaN | NaN | NaN | left_only |
| 6 | 3155451946_c0862c70cb.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 | NaN | NaN | NaN | left_only |
| 7 | 3222041930_f642f49d28.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 | NaN | NaN | NaN | left_only |
| 8 | 343218198_1ca90e0734.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 | NaN | NaN | NaN | left_only |
df_only_experts['text_match_img'] = df_only_experts['expert_assesment']
df_only_experts_final = df_only_experts[['image','query_id','query_text','text_match_img']]
df_only_experts_final['text_match_img'].unique()
array([0. , 0.33333333, 0.66666667, 1. ])
df_only_experts_final.head()
| image | query_id | query_text | text_match_img | |
|---|---|---|---|---|
| 1 | 1262583859_653f1469a9.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
| 4 | 2621415349_ef1a7e73be.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
| 6 | 3155451946_c0862c70cb.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
| 7 | 3222041930_f642f49d28.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
| 8 | 343218198_1ca90e0734.jpg | 2549968784_39bfbe44f9.jpg#2 | A young child is wearing blue goggles and sitt... | 0.0 |
Calculation of average value of ratings
df_train_no_text = df_train_ttl[df_train_ttl['_merge'] == 'right_only']
# search for unique pairs query - text
df_query_text = df_train[['query_id','query_text']].drop_duplicates()
# display of daaset before adding of text to columns
df_train_no_text.head()
| image | query_id | query_text | expert_assesment | confirmed_percentage | confimed_qty | rejected_qty | _merge | |
|---|---|---|---|---|---|---|---|---|
| 5819 | 1056338697_4f7d7ce270.jpg | 1056338697_4f7d7ce270.jpg#2 | NaN | NaN | 1.0 | 3.0 | 0.0 | right_only |
| 5820 | 1056338697_4f7d7ce270.jpg | 114051287_dd85625a04.jpg#2 | NaN | NaN | 0.0 | 0.0 | 3.0 | right_only |
| 5821 | 1056338697_4f7d7ce270.jpg | 1427391496_ea512cbe7f.jpg#2 | NaN | NaN | 0.0 | 0.0 | 3.0 | right_only |
| 5822 | 1056338697_4f7d7ce270.jpg | 2073964624_52da3a0fc4.jpg#2 | NaN | NaN | 0.0 | 0.0 | 3.0 | right_only |
| 5823 | 1056338697_4f7d7ce270.jpg | 2083434441_a93bc6306b.jpg#2 | NaN | NaN | 0.0 | 0.0 | 3.0 | right_only |
# adding the text to columns
text_list = []
for i in df_train_no_text['query_id']:
try:
text_list.append(df_query_text[df_query_text['query_id'] == i]['query_text'].values[0])
except:
text_list.append('')
df_train_no_text['query_text'] = text_list
df_train_no_text[df_train_no_text['query_text'] == '']['query_text'].count()
1109
df_train_no_text = df_train_no_text[df_train_no_text['query_text'] != '']
df_train_no_text['text_match_img'] = df_train_no_text['confirmed_percentage']
df_train_no_text_final = df_train_no_text[['image','query_id','query_text','text_match_img']]
# display unique ratings
df_train_no_text_final['text_match_img'].unique()
array([1. , 0. , 0.33333333, 0.66666667, 0.25 ,
0.6 , 0.2 , 0.5 , 0.4 , 0.75 ,
0.16666667, 0.8 ])
# display the table
df_train_no_text_final.head()
| image | query_id | query_text | text_match_img | |
|---|---|---|---|---|
| 5819 | 1056338697_4f7d7ce270.jpg | 1056338697_4f7d7ce270.jpg#2 | A woman is signaling is to traffic , as seen f... | 1.0 |
| 5820 | 1056338697_4f7d7ce270.jpg | 114051287_dd85625a04.jpg#2 | A boy in glasses is wearing a red shirt . | 0.0 |
| 5821 | 1056338697_4f7d7ce270.jpg | 1427391496_ea512cbe7f.jpg#2 | A young boy holds onto a blue handle on a pier . | 0.0 |
| 5822 | 1056338697_4f7d7ce270.jpg | 2073964624_52da3a0fc4.jpg#2 | A woman wearing black clothes , a purple scarf... | 0.0 |
| 5823 | 1056338697_4f7d7ce270.jpg | 2083434441_a93bc6306b.jpg#2 | An older woman with blond hair rides a bicycle... | 0.0 |
df_train_final = df_train_no_text_final.append(df_only_experts_final)
# index reset
df_train_final = df_train_final.append(df_train_both_final).reset_index(drop=True)
df_train_final.head()
| image | query_id | query_text | text_match_img | |
|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 1056338697_4f7d7ce270.jpg#2 | A woman is signaling is to traffic , as seen f... | 1.0 |
| 1 | 1056338697_4f7d7ce270.jpg | 114051287_dd85625a04.jpg#2 | A boy in glasses is wearing a red shirt . | 0.0 |
| 2 | 1056338697_4f7d7ce270.jpg | 1427391496_ea512cbe7f.jpg#2 | A young boy holds onto a blue handle on a pier . | 0.0 |
| 3 | 1056338697_4f7d7ce270.jpg | 2073964624_52da3a0fc4.jpg#2 | A woman wearing black clothes , a purple scarf... | 0.0 |
| 4 | 1056338697_4f7d7ce270.jpg | 2083434441_a93bc6306b.jpg#2 | An older woman with blond hair rides a bicycle... | 0.0 |
df_train_final.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50211 entries, 0 to 50210 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 50211 non-null object 1 query_id 50211 non-null object 2 query_text 50211 non-null object 3 text_match_img 50211 non-null float64 dtypes: float64(1), object(3) memory usage: 1.5+ MB
# calculation of quantity of image per rating
df_train_final_count = df_train_final.groupby(['text_match_img'])['image'].count()
# transform to df
df_train_final_count = pd.DataFrame(df_train_final_count)
# index reset
df_train_final_count = df_train_final_count.reset_index()
# calculation of percentage
df_train_final_count['percetnage'] = round(df_train_final_count['image'] / df_train_final_count['image'].sum()*100,2)
# display the table and plot the distribution
df_train_final_count.sort_values(by = 'image', ascending = False)
| text_match_img | image | percetnage | |
|---|---|---|---|
| 0 | 0.000000 | 42621 | 84.88 |
| 9 | 0.333333 | 2705 | 5.39 |
| 28 | 1.000000 | 1272 | 2.53 |
| 21 | 0.666667 | 1190 | 2.37 |
| 4 | 0.200000 | 843 | 1.68 |
| 11 | 0.333333 | 707 | 1.41 |
| 12 | 0.400000 | 190 | 0.38 |
| 16 | 0.533333 | 148 | 0.29 |
| 6 | 0.250000 | 88 | 0.18 |
| 10 | 0.333333 | 87 | 0.17 |
| 20 | 0.666667 | 77 | 0.15 |
| 27 | 0.866667 | 61 | 0.12 |
| 19 | 0.666667 | 55 | 0.11 |
| 26 | 0.800000 | 39 | 0.08 |
| 15 | 0.500000 | 26 | 0.05 |
| 5 | 0.200000 | 20 | 0.04 |
| 14 | 0.466667 | 19 | 0.04 |
| 24 | 0.733333 | 18 | 0.04 |
| 18 | 0.600000 | 12 | 0.02 |
| 13 | 0.400000 | 8 | 0.02 |
| 25 | 0.750000 | 7 | 0.01 |
| 2 | 0.133333 | 4 | 0.01 |
| 3 | 0.166667 | 3 | 0.01 |
| 17 | 0.560000 | 2 | 0.00 |
| 22 | 0.700000 | 2 | 0.00 |
| 23 | 0.720000 | 2 | 0.00 |
| 8 | 0.300000 | 2 | 0.00 |
| 7 | 0.280000 | 2 | 0.00 |
| 1 | 0.100000 | 1 | 0.00 |
plt.figure(figsize = (15,6))
plt.plot(df_train_final_count['text_match_img'], df_train_final_count['image'])
[<matplotlib.lines.Line2D at 0x1bac2b05060>]
In several countries where product could be release the are restriction related to search services: it's forbidden to publish information: texts, videos, audios, pictures, etc. with description, photo, video or voice of child. The child is a person under 16 age.
The service under development shall compile with laws in all countries where it will be represented. Due to that fact when someone wants to see the image forbidden to show by law, it's required to display the following disclaimer:
This image is unavailable in your country in compliance with local laws
However in this project it's required to delete from train sample the images which couldn't be display in accordance with law restrictions.
# function for censore content determination
def is_deprecated(text):
censored_list = ['boy', 'boys', 'girls','girl', 'teenagers', 'teenager','young', 'children', 'kid', 'kids', 'baby', 'child']
text = text.lower()
set_of_words = set(nltk.word_tokenize(text))
if set(censored_list) & set_of_words:
return True
else:
return False
# applying of function
df_train_final["deprecated"] = df_train_final['query_text'].apply(is_deprecated)
df_train_final.head()
| image | query_id | query_text | text_match_img | deprecated | |
|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 1056338697_4f7d7ce270.jpg#2 | A woman is signaling is to traffic , as seen f... | 1.0 | False |
| 1 | 1056338697_4f7d7ce270.jpg | 114051287_dd85625a04.jpg#2 | A boy in glasses is wearing a red shirt . | 0.0 | True |
| 2 | 1056338697_4f7d7ce270.jpg | 1427391496_ea512cbe7f.jpg#2 | A young boy holds onto a blue handle on a pier . | 0.0 | True |
| 3 | 1056338697_4f7d7ce270.jpg | 2073964624_52da3a0fc4.jpg#2 | A woman wearing black clothes , a purple scarf... | 0.0 | False |
| 4 | 1056338697_4f7d7ce270.jpg | 2083434441_a93bc6306b.jpg#2 | An older woman with blond hair rides a bicycle... | 0.0 | False |
# search of data to be deleted
df_train_censored = df_train_final[(df_train_final['deprecated'] == True) & (df_train_final['text_match_img'] >=0.5)]
df_train_censored.head()
| image | query_id | query_text | text_match_img | deprecated | |
|---|---|---|---|---|---|
| 171 | 1096395242_fc69f0ae5a.jpg | 1096395242_fc69f0ae5a.jpg#2 | A young boy with his foot outstretched aims a ... | 0.666667 | True |
| 371 | 1122944218_8eb3607403.jpg | 1122944218_8eb3607403.jpg#2 | A baby wearing a white gown waves a Muslim flag . | 1.000000 | True |
| 493 | 1131932671_c8d17751b3.jpg | 2461616306_3ee7ac1b4b.jpg#2 | a boy jumps into the blue pool water . | 0.666667 | True |
| 539 | 114051287_dd85625a04.jpg | 114051287_dd85625a04.jpg#2 | A boy in glasses is wearing a red shirt . | 1.000000 | True |
| 660 | 1174525839_7c1e6cfa86.jpg | 1352410176_af6b139734.jpg#2 | A young girl balances on wooden pylons at the ... | 0.666667 | True |
# deletion of censored content
df_train_final.drop(df_train_censored.index.values, axis = 0, inplace=True)
df_train_final = df_train_final.reset_index(drop = True)
df_train_final.head()
| image | query_id | query_text | text_match_img | deprecated | |
|---|---|---|---|---|---|
| 0 | 1056338697_4f7d7ce270.jpg | 1056338697_4f7d7ce270.jpg#2 | A woman is signaling is to traffic , as seen f... | 1.0 | False |
| 1 | 1056338697_4f7d7ce270.jpg | 114051287_dd85625a04.jpg#2 | A boy in glasses is wearing a red shirt . | 0.0 | True |
| 2 | 1056338697_4f7d7ce270.jpg | 1427391496_ea512cbe7f.jpg#2 | A young boy holds onto a blue handle on a pier . | 0.0 | True |
| 3 | 1056338697_4f7d7ce270.jpg | 2073964624_52da3a0fc4.jpg#2 | A woman wearing black clothes , a purple scarf... | 0.0 | False |
| 4 | 1056338697_4f7d7ce270.jpg | 2083434441_a93bc6306b.jpg#2 | An older woman with blond hair rides a bicycle... | 0.0 | False |
df_train_final.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49681 entries, 0 to 49680 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 image 49681 non-null object 1 query_id 49681 non-null object 2 query_text 49681 non-null object 3 text_match_img 49681 non-null float64 4 deprecated 49681 non-null bool dtypes: bool(1), float64(1), object(3) memory usage: 1.6+ MB
After the deletion of censored images the data has only 49681 records.
All image with rating 0.5 and above were deleted.
The images to be vectorized with pretrained ResNet-18 model with architecture of AdaptiveAvgPool2d and lining.
resnet = t_model.resnet18(pretrained = True)
for param in resnet.parameters():
param.requires_grad_(False)
list(resnet.children())
[Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False),
BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
ReLU(inplace=True),
MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False),
Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
),
Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
),
Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
),
Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
),
AdaptiveAvgPool2d(output_size=(1, 1)),
Linear(in_features=512, out_features=1000, bias=True)]
modules = list(resnet.children())[:-1]
resnet_1 = nn.Sequential(*modules)
resnet_1.eval()
norm = transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
norm,
])
def img_to_vec(data, path):
tensor_list = []
tmp = 0
for i in data['image']:
img = Image.open(path+i).convert('RGB')
image_tensor = preprocess(img)
image_tensor = image_tensor.unsqueeze(0)
output_tensor = resnet_1(image_tensor).flatten()
tensor_list.append(output_tensor.numpy())
tmp +=1
if tmp % 100 == 0:
print(round(tmp/(len(data['image']))*100,0),'%')
return tensor_list
Train_img_features / vectorization of train images
train_img_unique = df_train_final.copy()
train_img_unique = train_img_unique['image'].drop_duplicates().reset_index(drop=True)
train_img_unique = pd.DataFrame(train_img_unique, columns = ['image'])
train_img_unique_features = img_to_vec(train_img_unique, 'initial_data/train_images/')
10.0 % 20.0 % 30.0 % 40.0 % 50.0 % 60.0 % 70.0 % 80.0 % 90.0 % 100.0 %
# vectors to df
train_img_unique_features = pd.DataFrame(train_img_unique_features)
train_img_unique_features['image'] = train_img_unique
train_img = df_train_final.copy()
train_img = pd.DataFrame(train_img, columns = ['image'])
train_img_features_df = pd.merge(train_img,train_img_unique_features,on=('image'),how='outer')
train_img_features_df = train_img_features_df.drop(columns = 'image')
train_img_features_df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.693940 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.444168 | 0.717752 | 0.294673 | 0.728979 | 1.153704 | 0.750852 | 1.196701 | 0.085007 | 1.056700 | 0.098157 |
| 1 | 0.693940 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.444168 | 0.717752 | 0.294673 | 0.728979 | 1.153704 | 0.750852 | 1.196701 | 0.085007 | 1.056700 | 0.098157 |
| 2 | 0.693940 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.444168 | 0.717752 | 0.294673 | 0.728979 | 1.153704 | 0.750852 | 1.196701 | 0.085007 | 1.056700 | 0.098157 |
| 3 | 0.693940 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.444168 | 0.717752 | 0.294673 | 0.728979 | 1.153704 | 0.750852 | 1.196701 | 0.085007 | 1.056700 | 0.098157 |
| 4 | 0.693940 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.444168 | 0.717752 | 0.294673 | 0.728979 | 1.153704 | 0.750852 | 1.196701 | 0.085007 | 1.056700 | 0.098157 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49676 | 1.836281 | 0.537826 | 0.549317 | 1.469246 | 0.357578 | 1.096264 | 0.521474 | 0.780281 | 2.023147 | 0.112635 | ... | 0.964279 | 1.641931 | 1.305999 | 1.212959 | 0.144280 | 2.034770 | 0.497294 | 1.919764 | 1.319937 | 0.450033 |
| 49677 | 1.836281 | 0.537826 | 0.549317 | 1.469246 | 0.357578 | 1.096264 | 0.521474 | 0.780281 | 2.023147 | 0.112635 | ... | 0.964279 | 1.641931 | 1.305999 | 1.212959 | 0.144280 | 2.034770 | 0.497294 | 1.919764 | 1.319937 | 0.450033 |
| 49678 | 1.836281 | 0.537826 | 0.549317 | 1.469246 | 0.357578 | 1.096264 | 0.521474 | 0.780281 | 2.023147 | 0.112635 | ... | 0.964279 | 1.641931 | 1.305999 | 1.212959 | 0.144280 | 2.034770 | 0.497294 | 1.919764 | 1.319937 | 0.450033 |
| 49679 | 1.836281 | 0.537826 | 0.549317 | 1.469246 | 0.357578 | 1.096264 | 0.521474 | 0.780281 | 2.023147 | 0.112635 | ... | 0.964279 | 1.641931 | 1.305999 | 1.212959 | 0.144280 | 2.034770 | 0.497294 | 1.919764 | 1.319937 | 0.450033 |
| 49680 | 1.836281 | 0.537826 | 0.549317 | 1.469246 | 0.357578 | 1.096264 | 0.521474 | 0.780281 | 2.023147 | 0.112635 | ... | 0.964279 | 1.641931 | 1.305999 | 1.212959 | 0.144280 | 2.034770 | 0.497294 | 1.919764 | 1.319937 | 0.450033 |
49681 rows × 512 columns
Test img features / vectorisation of test images
test_img_features_df = img_to_vec(df_test_images, 'initial_data/test_images/')
100.0 %
test_img_features_df = pd.DataFrame(test_img_features_df)
test_img_features_df['image'] = df_test_images['image']
test_img_features_df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | image | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.669249 | 0.004512 | 0.236133 | 0.882863 | 1.709107 | 0.125581 | 1.020888 | 0.168621 | 1.178943 | 0.865995 | ... | 0.749478 | 2.343132 | 0.512026 | 1.853229 | 1.385057 | 0.171342 | 0.792277 | 0.206550 | 1.448476 | 3356748019_2251399314.jpg |
| 1 | 0.949634 | 3.252454 | 0.698395 | 1.281548 | 0.327134 | 0.496780 | 0.039867 | 0.161799 | 0.413237 | 0.498238 | ... | 0.314517 | 1.681127 | 0.658190 | 2.667417 | 0.813721 | 1.553186 | 1.157109 | 1.100792 | 0.439287 | 2887171449_f54a2b9f39.jpg |
| 2 | 1.417580 | 0.973497 | 0.584019 | 0.330045 | 2.646095 | 0.159388 | 1.164999 | 0.870697 | 0.707525 | 1.298972 | ... | 0.402308 | 0.314999 | 0.221909 | 0.883103 | 0.438698 | 1.542921 | 0.381525 | 0.060120 | 0.403171 | 3089107423_81a24eaf18.jpg |
| 3 | 0.189787 | 1.876468 | 0.825718 | 0.621149 | 2.310625 | 0.565720 | 0.033543 | 0.556440 | 0.402366 | 4.919666 | ... | 0.729771 | 0.114588 | 0.915989 | 0.860376 | 1.078615 | 0.487137 | 1.073111 | 0.847126 | 0.025273 | 1429546659_44cb09cbe2.jpg |
| 4 | 0.443970 | 2.334722 | 0.006340 | 2.394032 | 0.022515 | 0.329775 | 1.684274 | 1.900405 | 0.215839 | 0.033497 | ... | 0.322904 | 1.906021 | 1.066521 | 0.000000 | 2.232223 | 2.022680 | 2.292787 | 0.793257 | 0.850167 | 1177994172_10d143cb8d.jpg |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 1.208322 | 0.739464 | 0.050839 | 0.083080 | 0.920781 | 0.655538 | 0.497049 | 0.089066 | 1.798971 | 0.400781 | ... | 0.555691 | 0.512230 | 0.740154 | 1.496848 | 0.393386 | 0.572447 | 0.359475 | 0.984353 | 0.530105 | 2431120202_b24fe2333a.jpg |
| 96 | 1.183169 | 1.067630 | 1.575608 | 2.379034 | 1.213933 | 0.715755 | 0.162690 | 1.838291 | 0.662763 | 1.722738 | ... | 0.926513 | 3.265692 | 1.928150 | 1.340862 | 1.124190 | 1.901763 | 0.053659 | 1.678188 | 0.391409 | 2399219552_bbba0a9a59.jpg |
| 97 | 0.090710 | 1.954055 | 0.718825 | 0.601076 | 0.089057 | 0.158459 | 1.597549 | 0.435845 | 0.418540 | 0.376853 | ... | 0.656670 | 0.903572 | 0.796692 | 0.846823 | 0.718059 | 0.633133 | 1.676569 | 1.758489 | 0.122395 | 3091962081_194f2f3bd4.jpg |
| 98 | 0.193619 | 0.847930 | 0.737440 | 0.553259 | 0.133539 | 0.554890 | 0.057230 | 1.708564 | 1.222293 | 0.635215 | ... | 0.988613 | 0.946414 | 0.568584 | 1.705734 | 0.122703 | 0.776672 | 0.540184 | 2.390953 | 1.154196 | 2670637584_d96efb8afa.jpg |
| 99 | 0.094777 | 0.438139 | 2.430755 | 0.666201 | 0.313920 | 0.448235 | 0.630196 | 1.002802 | 1.046606 | 0.097983 | ... | 1.649578 | 0.877142 | 0.746205 | 1.250985 | 0.561963 | 1.122451 | 0.608868 | 1.804157 | 0.771301 | 2346402952_e47d0065b6.jpg |
100 rows × 513 columns
the following datasets are obtained:
1) Train - 49681 * 512
2) Test - 100 * 512 + image file name
stop_words = set(stopwords.words('english'))
# function for obtaining of list of words from text
def df_to_text(df_train_censored):
train_text_ = []
for i in df_train_censored['query_text']:
processed_text = i.lower()
processed_text = re.sub('[^a-zA-Z]', ' ', processed_text)
processed_text = re.sub(r'\s+', ' ', processed_text)
processed_text = nltk.word_tokenize(processed_text)
filtered_words= []
for w in processed_text:
if w not in stop_words:
filtered_words.append(w)
train_text_.append(filtered_words)
return(train_text_)
train_text_ = df_to_text(df_train_final)
test_text = df_to_text(df_test_queries)
_text_ = train_text_+test_text
model_wv = Word2Vec(_text_, min_count=1, vector_size = 300)
# function trasnform text to vector
def text_to_vec(df_test_queries):
text_vector =[]
for j in df_test_queries['query_text']:
processed_text = j.lower()
processed_text = re.sub('[^a-zA-Z]', ' ', processed_text)
processed_text = re.sub(r'\s+', ' ', processed_text)
processed_text = nltk.word_tokenize(processed_text)
filtered_words= []
for w in processed_text:
if w not in stop_words:
filtered_words.append(w)
text_vector.append(filtered_words)
vector_list = []
for h in text_vector:
v1 = model_wv.wv[h]
vector_list.append(np.mean(v1, axis=0))
df = pd.DataFrame(vector_list)
return(df)
train_text_features = text_to_vec(df_train_final)
test_text_features = text_to_vec(df_test_queries)
train_text_features
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.223816 | 0.043280 | -0.078086 | 0.221208 | -0.001817 | -0.249398 | -0.274902 | 0.260870 | -0.007998 | -0.012579 | ... | 0.289845 | 0.128652 | 0.490216 | -0.059087 | 0.164893 | 0.016287 | -0.092402 | -0.374867 | 0.088822 | 0.064636 |
| 1 | -0.128972 | 0.093621 | -0.237345 | -0.089551 | 0.059418 | -0.434698 | 0.084664 | 0.379581 | 0.555430 | -0.072260 | ... | 0.414929 | 0.258088 | 0.404733 | 0.224961 | 0.241238 | -0.147861 | -0.232544 | -0.302437 | -0.257968 | 0.336313 |
| 2 | 0.143807 | 0.444889 | -0.324407 | -0.197311 | 0.073131 | -0.598512 | 0.209585 | -0.076982 | 0.168799 | -0.078752 | ... | 0.036598 | 0.248512 | 0.169012 | 0.203210 | 0.337754 | -0.040154 | 0.201049 | -0.237515 | 0.004690 | 0.255246 |
| 3 | -0.139792 | 0.177083 | 0.015613 | -0.184344 | -0.255431 | -0.502678 | 0.414699 | 0.654368 | 0.386166 | -0.275497 | ... | 0.357414 | 0.623727 | 0.509810 | 0.049272 | 0.461286 | -0.032211 | -0.158960 | -0.386488 | -0.450289 | 0.372033 |
| 4 | -0.105150 | 0.551294 | -0.398036 | 0.083901 | -0.034659 | -0.630937 | 0.002720 | 0.914875 | -0.032187 | 0.034839 | ... | 0.038328 | 0.454754 | 0.603128 | -0.060601 | 0.316756 | 0.221336 | -0.138929 | -0.532456 | -0.240612 | 0.058491 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49676 | -0.120095 | 0.041895 | -0.311475 | -0.180811 | -0.079752 | -0.743445 | 0.101740 | 0.564519 | -0.155389 | -0.220305 | ... | 0.180551 | 0.309318 | 0.293546 | 0.192794 | 0.397331 | 0.056426 | -0.076100 | -0.223015 | -0.052298 | 0.267127 |
| 49677 | 0.035119 | 0.012204 | 0.046422 | -0.108008 | -0.115551 | -0.194905 | 0.034466 | 0.596069 | 0.127441 | -0.217881 | ... | 0.025535 | 0.241457 | 0.129289 | -0.129272 | 0.123747 | 0.288869 | -0.260914 | -0.168603 | -0.174653 | 0.030511 |
| 49678 | -0.092819 | 0.047741 | -0.013888 | 0.120759 | 0.091784 | -0.138202 | -0.230959 | 0.161287 | -0.010925 | 0.131612 | ... | 0.065074 | -0.022302 | 0.218122 | -0.025385 | 0.270096 | -0.058928 | 0.071160 | -0.170461 | 0.281878 | -0.281370 |
| 49679 | 0.211670 | -0.080929 | -0.232407 | -0.074551 | 0.133177 | -0.439170 | -0.220965 | 0.452511 | 0.419318 | 0.016256 | ... | -0.039866 | 0.117169 | 0.195517 | -0.109854 | 0.339071 | 0.004756 | -0.249746 | -0.375533 | -0.016828 | 0.056922 |
| 49680 | -0.032769 | 0.286745 | 0.170310 | 0.043912 | -0.010254 | 0.217619 | -0.002461 | 0.200521 | -0.103633 | 0.101296 | ... | -0.149822 | 0.207247 | -0.322923 | 0.036414 | 0.320609 | 0.409033 | 0.187291 | 0.177368 | 0.426240 | -0.280256 |
49681 rows × 300 columns
len(train_text_features.columns)
300
text_col_names = []
number = 0
for i in range(len(train_text_features.columns)):
text_col_names.append('text_col_'+str(number))
number+=1
train_text_features.columns = text_col_names
test_text_features.columns = text_col_names
train_text_features.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49681 entries, 0 to 49680 Columns: 300 entries, text_col_0 to text_col_299 dtypes: float32(300) memory usage: 56.9 MB
train_text_features.head()
| text_col_0 | text_col_1 | text_col_2 | text_col_3 | text_col_4 | text_col_5 | text_col_6 | text_col_7 | text_col_8 | text_col_9 | ... | text_col_290 | text_col_291 | text_col_292 | text_col_293 | text_col_294 | text_col_295 | text_col_296 | text_col_297 | text_col_298 | text_col_299 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.223816 | 0.043280 | -0.078086 | 0.221208 | -0.001817 | -0.249398 | -0.274902 | 0.260870 | -0.007998 | -0.012579 | ... | 0.289845 | 0.128652 | 0.490216 | -0.059087 | 0.164893 | 0.016287 | -0.092402 | -0.374867 | 0.088822 | 0.064636 |
| 1 | -0.128972 | 0.093621 | -0.237345 | -0.089551 | 0.059418 | -0.434698 | 0.084664 | 0.379581 | 0.555430 | -0.072260 | ... | 0.414929 | 0.258088 | 0.404733 | 0.224961 | 0.241238 | -0.147861 | -0.232544 | -0.302437 | -0.257968 | 0.336313 |
| 2 | 0.143807 | 0.444889 | -0.324407 | -0.197311 | 0.073131 | -0.598512 | 0.209585 | -0.076982 | 0.168799 | -0.078752 | ... | 0.036598 | 0.248512 | 0.169012 | 0.203210 | 0.337754 | -0.040154 | 0.201049 | -0.237515 | 0.004690 | 0.255246 |
| 3 | -0.139792 | 0.177083 | 0.015613 | -0.184344 | -0.255431 | -0.502678 | 0.414699 | 0.654368 | 0.386166 | -0.275497 | ... | 0.357414 | 0.623727 | 0.509810 | 0.049272 | 0.461286 | -0.032211 | -0.158960 | -0.386488 | -0.450289 | 0.372033 |
| 4 | -0.105150 | 0.551294 | -0.398036 | 0.083901 | -0.034659 | -0.630937 | 0.002720 | 0.914875 | -0.032187 | 0.034839 | ... | 0.038328 | 0.454754 | 0.603128 | -0.060601 | 0.316756 | 0.221336 | -0.138929 | -0.532456 | -0.240612 | 0.058491 |
5 rows × 300 columns
train_text_features['image'] = df_train_final['image']
train_text_features['target'] = df_train_final['text_match_img']
for i in ['query_id','query_text','image']:
test_text_features[i] = df_test_queries[i]
test_text_features.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Columns: 303 entries, text_col_0 to image dtypes: float32(300), object(3) memory usage: 597.8+ KB
the following datasets are prepared:
1) train - 49681 * 1302 + target and image file name
2) tst - 500 * 1302 + columns with text and image file name
Preparation of full datasets with features and targets for train and test samples
train
train_df_prepared = train_img_features_df.join(train_text_features)
train_df_prepared.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | text_col_292 | text_col_293 | text_col_294 | text_col_295 | text_col_296 | text_col_297 | text_col_298 | text_col_299 | image | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.69394 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.490216 | -0.059087 | 0.164893 | 0.016287 | -0.092402 | -0.374867 | 0.088822 | 0.064636 | 1056338697_4f7d7ce270.jpg | 1.0 |
| 1 | 0.69394 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.404733 | 0.224961 | 0.241238 | -0.147861 | -0.232544 | -0.302437 | -0.257968 | 0.336313 | 1056338697_4f7d7ce270.jpg | 0.0 |
| 2 | 0.69394 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.169012 | 0.203210 | 0.337754 | -0.040154 | 0.201049 | -0.237515 | 0.004690 | 0.255246 | 1056338697_4f7d7ce270.jpg | 0.0 |
| 3 | 0.69394 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.509810 | 0.049272 | 0.461286 | -0.032211 | -0.158960 | -0.386488 | -0.450289 | 0.372033 | 1056338697_4f7d7ce270.jpg | 0.0 |
| 4 | 0.69394 | 3.031836 | 2.916933 | 0.951898 | 0.936295 | 1.245117 | 0.826524 | 1.107943 | 0.169679 | 0.365382 | ... | 0.603128 | -0.060601 | 0.316756 | 0.221336 | -0.138929 | -0.532456 | -0.240612 | 0.058491 | 1056338697_4f7d7ce270.jpg | 0.0 |
5 rows × 814 columns
train_df_prepared.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49681 entries, 0 to 49680 Columns: 814 entries, 0 to target dtypes: float32(812), float64(1), object(1) memory usage: 156.0+ MB
For models training the train sample to be splat on train and valid samples using groupshufflesplit;
Neural networks to be trained and their scores compared to select the best model for future testing.
Data splitting on train and valid samples
gss = GroupShuffleSplit(n_splits=1, train_size=.7, random_state=42)
train_indices, valid_indices = next(gss.split(X=train_df_prepared.drop(columns=['target','image']), y=train_df_prepared['target'], groups=train_df_prepared['image']))
train_df, valid_df = train_df_prepared.loc[train_indices], train_df_prepared.loc[valid_indices]
train_features, train_target = train_df.drop(columns=['target','image']), train_df['target']
valid_features, valid_target = valid_df.drop(columns=['target','image']), valid_df['target']
Linear regression model training
lr_model = LinearRegression()
lr_model.fit(train_features,train_target)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
lr_prediction = lr_model.predict(valid_features)
rmse_lr_model = mean_squared_error(lr_prediction, valid_target, squared = False)
rmse_lr_model
0.18813374349341422
Neural Networks training
train_features = torch.tensor(train_features.values)
valid_features = torch.tensor(valid_features.values)
valid_target = torch.tensor(valid_target.values)
train_target = torch.tensor(train_target.values)
# building of neural network 1
torch.manual_seed(1234)
input_size = 812
output_size = 1
class NeuralNet(nn.Module):
def __init__(self, input_size, output_size):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(input_size, output_size)
self.act1 = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.act1(x)
return x
model_nn = NeuralNet(input_size,output_size)
# train of model 1
optimizer = torch.optim.Adam(model_nn.parameters(),lr=0.00001)
loss = torch.nn.MSELoss()
num_epochs = 25
for epoch in range(num_epochs):
optimizer.zero_grad()
preds = model_nn.forward(train_features.float()).flatten()
loss_value = loss(preds,train_target.float())
loss_value.backward()
optimizer.step()
if (epoch % 2 == 0) or (epoch == 25):
model_nn.eval()
valid_preds_nn = model_nn.forward(valid_features.float()).flatten()
loss_preds = loss(valid_preds_nn,valid_target)
print('valid_loss:',loss_preds)
rmse_nn = mean_squared_error(valid_preds_nn.detach().numpy(), valid_target.detach().numpy(), squared = False)
print('valid_rmse_x1000:',round(rmse_nn*1000))
valid_loss: tensor(0.1906, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 437 valid_loss: tensor(0.1846, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 430 valid_loss: tensor(0.1788, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 423 valid_loss: tensor(0.1732, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 416 valid_loss: tensor(0.1677, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 410 valid_loss: tensor(0.1625, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 403 valid_loss: tensor(0.1574, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 397 valid_loss: tensor(0.1525, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 390 valid_loss: tensor(0.1477, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 384 valid_loss: tensor(0.1432, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 378 valid_loss: tensor(0.1388, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 373 valid_loss: tensor(0.1346, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 367 valid_loss: tensor(0.1305, dtype=torch.float64, grad_fn=<MseLossBackward0>) valid_rmse_x1000: 361
# building of neural network 2
torch.manual_seed(1234)
input_size = 812
hidden_size_1 = 512
hidden_size_2 = 32
output_size = 1
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size_1, hidden_size_2, output_size):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size_1)
self.fc2 = nn.Linear(hidden_size_1, hidden_size_2)
self.fc3 = nn.Linear(hidden_size_2, output_size)
self.act3 = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
x = self.act3(x)
return x
model_nn_2 = NeuralNet(input_size, hidden_size_1, hidden_size_2, output_size)
optimizer = torch.optim.Adam(model_nn_2.parameters(),lr=0.00005)
loss = torch.nn.MSELoss()
num_epochs = 80
for epoch in range(num_epochs):
optimizer.zero_grad()
preds = model_nn_2.forward(train_features.float()).flatten()
loss_value = loss(preds,train_target.float())
loss_value.backward()
optimizer.step()
if (epoch % 10 == 0) or (epoch == 80):
model_nn_2.eval()
valid_preds_nn_2 = model_nn_2.forward(valid_features.float()).flatten()
loss_preds_2 = loss(valid_preds_nn_2,valid_target.float())
print('valid_loss:',loss_preds_2)
rmse_nn_2 = mean_squared_error(valid_preds_nn_2.detach().numpy(), valid_target.detach().numpy(), squared = False)
print('valid_rmse:',round(rmse_nn_2*1000))
valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201 valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse: 201
# building of neural network 3
torch.manual_seed(1234)
input_size = 812
hidden_size_1 = 512
hidden_size_2 = 64
output_size = 1
class NeuralNet(nn.Module):
def __init__(self, input_size, hidden_size_1, hidden_size_2, output_size):
super(NeuralNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size_1)
self.dp1 = nn.Dropout(p = 0.1)
self.act1 = nn.ReLU()
self.fc2 = nn.Linear(hidden_size_1, hidden_size_2)
self.dp2 = nn.Dropout(p = 0.05)
self.act2 = nn.ReLU()
self.fc3 = nn.Linear(hidden_size_2, output_size)
self.dp3 = nn.Dropout(p = 0.05)
self.act3 = nn.ReLU()
def forward(self, x):
x = self.fc1(x)
x = self.dp1(x)
x = self.act1(x)
x = self.fc2(x)
x = self.dp2(x)
x = self.act2(x)
x = self.fc3(x)
x = self.dp3(x)
x = self.act3(x)
return x
model_nn_3 = NeuralNet(input_size, hidden_size_1, hidden_size_2, output_size)
optimizer = torch.optim.Adam(model_nn_3.parameters(),lr=0.00005)
loss = torch.nn.MSELoss()
num_epochs = 60
for epoch in range(num_epochs):
optimizer.zero_grad()
preds = model_nn_3.forward(train_features.float()).flatten()
loss_value = loss(preds,train_target.float())
loss_value.backward()
optimizer.step()
if (epoch % 6 == 0) or (epoch == 60):
model_nn_3.eval()
valid_preds_nn_3 = model_nn_3.forward(valid_features.float()).flatten()
loss_preds_3 = loss(valid_preds_nn_3,valid_target.float())
print('valid_loss:',loss_preds_3)
rmse_nn_3 = mean_squared_error(valid_preds_nn_3.detach().numpy(), valid_target.detach().numpy(), squared = False)
print('valid_rmse*1000:',round(rmse_nn_3*1000))
valid_loss: tensor(0.0404, grad_fn=<MseLossBackward0>) valid_rmse*1000: 201 valid_loss: tensor(0.0386, grad_fn=<MseLossBackward0>) valid_rmse*1000: 196 valid_loss: tensor(0.0368, grad_fn=<MseLossBackward0>) valid_rmse*1000: 192 valid_loss: tensor(0.0362, grad_fn=<MseLossBackward0>) valid_rmse*1000: 190 valid_loss: tensor(0.0357, grad_fn=<MseLossBackward0>) valid_rmse*1000: 189 valid_loss: tensor(0.0353, grad_fn=<MseLossBackward0>) valid_rmse*1000: 188 valid_loss: tensor(0.0351, grad_fn=<MseLossBackward0>) valid_rmse*1000: 187 valid_loss: tensor(0.0348, grad_fn=<MseLossBackward0>) valid_rmse*1000: 187 valid_loss: tensor(0.0347, grad_fn=<MseLossBackward0>) valid_rmse*1000: 186 valid_loss: tensor(0.0345, grad_fn=<MseLossBackward0>) valid_rmse*1000: 186
Selection of best model
model_list = [lr_model,model_nn,model_nn_2,model_nn_3]
loss_preds_list = [rmse_lr_model,rmse_nn,rmse_nn_2,rmse_nn_3]
loss_preds_list
[0.18813374349341422, 0.361310782909795, 0.20091421231540807, 0.1858130876712399]
models_df = pd.DataFrame(model_list,loss_preds_list, columns = ['model'])
models_df = models_df.reset_index()
models_df = models_df.rename(columns = {'index' : 'loss'})
models_df.sort_values(by = 'loss')
| loss | model | |
|---|---|---|
| 3 | 0.185813 | NeuralNet(\n (fc1): Linear(in_features=812, o... |
| 0 | 0.188134 | LinearRegression() |
| 2 | 0.200914 | NeuralNet(\n (fc1): Linear(in_features=812, o... |
| 1 | 0.361311 | NeuralNet(\n (fc1): Linear(in_features=812, o... |
best_model = models_df[models_df['loss'] == models_df['loss'].min()]['model']
best_model = best_model.values[0]
best_model
NeuralNet( (fc1): Linear(in_features=812, out_features=512, bias=True) (dp1): Dropout(p=0.1, inplace=False) (act1): ReLU() (fc2): Linear(in_features=512, out_features=64, bias=True) (dp2): Dropout(p=0.05, inplace=False) (act2): ReLU() (fc3): Linear(in_features=64, out_features=1, bias=True) (dp3): Dropout(p=0.05, inplace=False) (act3): ReLU() )
For model testing it's required to one function that will select the input numbers of queries and search for the best suitable image to each query and display the results - image. query and percentage of correspondance.
After model testing it's required to analyze the results.
# demo testing function
def test(n):
# create a dataset with n quantity of random text queries
df_ten_test_queries = pd.DataFrame(test_text_features['query_text'].sample(n))
df_ten_test_queries = df_ten_test_queries.reset_index(drop = True)
df_test_final = pd.DataFrame(index=range(len(df_ten_test_queries)*len(test_img_features_df['image'])),columns=range(2))
df_test_final.columns = ['query_text','image']
# create a dataset of all possible pairs of n quantity of queries and images
n = 0
for i in range(len(df_ten_test_queries.index)):
for j in test_img_features_df.index:
df_test_final['query_text'][j+n] = df_ten_test_queries['query_text'][i]
df_test_final['image'][j+n] = test_img_features_df['image'][j]
n+=100
# transform dataset to vector
df_test_final = pd.merge(df_test_final,test_img_features_df,on=('image'),how='inner')
df_text_vectors = test_text_features.drop(columns =['image','query_id'])
df_test_final = pd.merge(df_test_final,df_text_vectors,on=('query_text'),how='inner')
# prediction of percentage of image and text correspondance
test_features = df_test_final.drop(columns = ['image','query_text'])
test_features = torch.tensor(test_features.values)
predictions = best_model.forward(test_features.float()).flatten()
df_test_final['result'] = predictions.detach().numpy()
df_test_final = df_test_final[['query_text','image','result']]
# selection the image with highest predicted rating for each query
img_list = []
prediction_list = []
for i in df_ten_test_queries['query_text']:
temp_df_test = df_test_final[df_test_final['query_text'] == i]
img_list.append(temp_df_test[temp_df_test['result'] == temp_df_test['result'].max()]['image'].values[0])
prediction_list.append(temp_df_test[temp_df_test['result'] == temp_df_test['result'].max()]['result'].values[0])
df_ten_test_queries['image'] = img_list
df_ten_test_queries['prediction'] = prediction_list
# censure check
df_ten_test_queries["deprecated"] = df_ten_test_queries['query_text'].apply(is_deprecated)
# display of image, text and % correspondance
for i in range(len(df_ten_test_queries['image'])):
if df_ten_test_queries['deprecated'][i] == False:
img = Image.open('initial_data/test_images/'+df_ten_test_queries['image'][i]).convert('RGB')
print('Text:',df_ten_test_queries['query_text'][i],'\n','Prediction:', round(df_ten_test_queries['prediction'][i]*100,2),'%')
display(img)
print('\n')
else:
print('Text:',df_ten_test_queries['query_text'][i],'\n','Prediction:', round(df_ten_test_queries['prediction'][i]*100,2),'%')
print('\n','\033[1m'+'\033[91m'+'This image is unavailable in your country in compliance with local laws','\n')
print('\u001b[0m'+'\n')
test(10)
Text: A shirtless male looks to his right while water flows over him . Prediction: 18.83 %
Text: A group of people in anime cosplay costumes . Prediction: 15.5 %
Text: Women wearing red and black are clapping . Prediction: 14.32 %
Text: Closeup of a man at an event with formal attire . Prediction: 16.5 %
Text: A small boy wearing glasses stands on a rope and holds two ropes with his hands . Prediction: 13.62 % This image is unavailable in your country in compliance with local laws Text: A girl with Indian clothing on and henna on her hand goes through paperwork . Prediction: 14.23 % This image is unavailable in your country in compliance with local laws Text: A dog wrapped with straps is walking away from a red tray holding a bag . Prediction: 17.8 %
Text: a woman dumping water on a small child who is in a pool
Prediction: 16.05 %
This image is unavailable in your country in compliance with local laws
Text: Two men standing near a metal structure in from of a brick wall .
Prediction: 21.66 %
Text: A black and white dog with a green collar stands in front of a sign . Prediction: 16.96 %
The result of model testing are quite low, below you can find the alternative proposed model for demo version of search.
Applying of Alternative model
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
def test_clip(n):
# create a dataset with n quantity of random text queries
df_ten_test_queries = pd.DataFrame(test_text_features['query_text'].sample(n))
df_ten_test_queries = df_ten_test_queries.reset_index(drop = True)
df_test_final = pd.DataFrame(index=range(len(df_ten_test_queries)*len(test_img_features_df['image'])),columns=range(2))
df_test_final.columns = ['query_text','image']
# create a dataset of all possible pairs of n quantity of queries and images
n = 0
for i in range(len(df_ten_test_queries.index)):
for j in test_img_features_df.index:
df_test_final['query_text'][j+n] = df_ten_test_queries['query_text'][i]
df_test_final['image'][j+n] = test_img_features_df['image'][j]
n+=100
list_result = []
# prediction of percentage of image and text correspondance
for i in range(len(df_test_final['query_text'])):
#Encode an image:
img_emb = model.encode(Image.open(DATA_PATH+'test_images/' + df_test_final['image'][int(i)]))
#Encode text descriptions
text_emb = model.encode(df_test_final['query_text'][i])
#Compute cosine similarities
cos_scores = util.cos_sim(img_emb, text_emb)
cos_scores = cos_scores.detach().numpy()
list_result.append(cos_scores[0][0])
df_test_final['result'] = list_result
# selection the image with highest predicted rating for each query
img_list = []
prediction_list = []
for i in df_ten_test_queries['query_text']:
temp_df_test = df_test_final[df_test_final['query_text'] == i]
img_list.append(temp_df_test[temp_df_test['result'] == temp_df_test['result'].max()]['image'].values[0])
prediction_list.append(temp_df_test[temp_df_test['result'] == temp_df_test['result'].max()]['result'].values[0])
df_ten_test_queries['image'] = img_list
df_ten_test_queries['prediction'] = prediction_list
# censure check
df_ten_test_queries["deprecated"] = df_ten_test_queries['query_text'].apply(is_deprecated)
# display of image, text and % correspondance
for i in range(len(df_ten_test_queries['image'])):
if df_ten_test_queries['deprecated'][i] == False:
img = Image.open('initial_data/test_images/'+df_ten_test_queries['image'][i]).convert('RGB')
print('Text:',df_ten_test_queries['query_text'][i],'\n','Prediction:', round(df_ten_test_queries['prediction'][i]*100,2),'%')
display(img)
print('\n')
else:
print('Text:',df_ten_test_queries['query_text'][i],'\n','Prediction:', round(df_ten_test_queries['prediction'][i]*100,2),'%')
print('\n','\033[1m'+'\033[91m'+'This image is unavailable in your country in compliance with local laws'+'\033[91m'+'\033[1m','\n')
print('\u001b[0m'+'\n')
test_clip(10)
Text: Woman with glasses working at a sewing machine . Prediction: 34.13 %
Text: The heavy man is sitting on a bus and sleeping . Prediction: 32.29 %
Text: A black dog jumps into the air to get a treat from its owner . Prediction: 29.87 %
Text: A dog in a harness pulling a pink carrier behind it on snow . Prediction: 31.03 %
Text: a brown and white dog jumps on the sidewalk . Prediction: 29.94 %
Text: Grey horse wearing blue cover eating from a orange bucket held by a person in a green shirt . Prediction: 33.28 %
Text: a guy and a girl jumping up in the air Prediction: 28.84 % This image is unavailable in your country in compliance with local laws Text: Women play lacrosse . Prediction: 32.15 %
Text: While holding tight to the ball , the man in red socks is getting tackled . Prediction: 26.85 %
Text: A little girls waist high in sand Prediction: 26.95 % This image is unavailable in your country in compliance with local laws
Conclusion of model testing and searching of images by query: